New notebook
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
import warnings
warnings.filterwarnings("ignore")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 11, Finished, Available)
The code executed in cell 3 attempted to install specific versions of libraries using pip
Loan = pd.read_csv("Files/Loan_Modelling.csv")
display(Loan)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 12, Finished, Available)
SynapseWidget(Synapse.DataFrame, a7a06ff7-8493-4a88-90c5-7592d557c107)
The CSV file "Loan_Modelling.csv" was successfully read into a Pandas DataFrame.
The DataFrame Loan contains the following columns: ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
# copying data to another variable to avoid any changes to original data
data = Loan.copy()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 13, Finished, Available)
data.head()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 14, Finished, Available)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 16, Finished, Available)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
# Display the shape of the dataset
print("Shape of the Loan_Modelling dataset:")
print(data.shape)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 18, Finished, Available)
Shape of the Loan_Modelling dataset: (5000, 14)
This script will load the Loan_Modelling dataset into a pandas DataFrame and then print out the shape of the dataset, which includes the number of rows and columns in the dataset.
Questions:
# Display the data types of columns in the Loan_Modelling dataset
#print(data.dtypes)
data.info()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 22, Finished, Available)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
The df.info() provides a concise summary of the DataFrame df, including information about the data types, non-null values, and memory usage. It displays the column names, count of non-null values in each column, data type of each column, and the total memory usage by the DataFrame. This method is useful for quickly understanding the structure of the dataset and identifying any missing values or potential data type inconsistencies. By using df.info(), you can gain insights into the dataset's composition and prepare for further data preprocessing or analysis tasks.
# Display the statistical summary of the Loan_Modelling dataset
print(data.describe())
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 25, Finished, Available)
ID Age Experience Income ZIPCode \
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000
mean 2500.500000 45.338400 20.104600 73.774200 93169.257000
std 1443.520003 11.463166 11.467954 46.033729 1759.455086
min 1.000000 23.000000 -3.000000 8.000000 90005.000000
25% 1250.750000 35.000000 10.000000 39.000000 91911.000000
50% 2500.500000 45.000000 20.000000 64.000000 93437.000000
75% 3750.250000 55.000000 30.000000 98.000000 94608.000000
max 5000.000000 67.000000 43.000000 224.000000 96651.000000
Family CCAvg Education Mortgage Personal_Loan \
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000
mean 2.396400 1.937938 1.881000 56.498800 0.096000
std 1.147663 1.747659 0.839869 101.713802 0.294621
min 1.000000 0.000000 1.000000 0.000000 0.000000
25% 1.000000 0.700000 1.000000 0.000000 0.000000
50% 2.000000 1.500000 2.000000 0.000000 0.000000
75% 3.000000 2.500000 3.000000 101.000000 0.000000
max 4.000000 10.000000 3.000000 635.000000 1.000000
Securities_Account CD_Account Online CreditCard
count 5000.000000 5000.00000 5000.000000 5000.000000
mean 0.104400 0.06040 0.596800 0.294000
std 0.305809 0.23825 0.490589 0.455637
min 0.000000 0.00000 0.000000 0.000000
25% 0.000000 0.00000 0.000000 0.000000
50% 0.000000 0.00000 1.000000 0.000000
75% 0.000000 0.00000 1.000000 1.000000
max 1.000000 1.00000 1.000000 1.000000
# Dropping columns from Loan_Modelling dataset
data = data.drop(['ZIPCode', 'Family', 'Mortgage'], axis=1, inplace=True)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 27, Finished, Available)
The purpose of this code is to remove unnecessary columns from the Loan_Modelling dataset to potentially improve model performance or simplify data analysis by focusing only on relevant features.
data["Experience"].unique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 39, Finished, Available)
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43])
# checking for experience <0
data[data["Experience"] < 0]["Experience"].unique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 40, Finished, Available)
array([-1, -2, -3])
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 43, Finished, Available)
data["Education"].unique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 44, Finished, Available)
array([1, 2, 3])
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 45, Finished, Available)
467
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"] = data["ZIPCode"].astype("category")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 46, Finished, Available)
Number of unique values if we take first two digits of ZIPCode: 7
## Converting the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 48, Finished, Available)
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 49, Finished, Available)
Overall, this summary analysis provides valuable insights into the demographic and financial characteristics of the customer base. Further analysis could focus on exploring relationships between variables such as income and credit card spending or identifying factors that influence customer decisions to take up personal loans or other financial products offered by the bank.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 53, Finished, Available)
import matplotlib.pyplot as plt
def histogram_boxplot(data, column):
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
# Histogram
ax[0].hist(data[column], bins=20, color='skyblue', edgecolor='black')
ax[0].set_title(f'Histogram of {column}')
ax[0].set_xlabel(column)
ax[0].set_ylabel('Frequency')
# Boxplot
ax[1].boxplot(data[column], vert=False)
ax[1].set_title(f'Boxplot of {column}')
ax[1].set_xlabel(column)
plt.tight_layout()
plt.show()
# Call the function with the data and column name
histogram_boxplot(data, "Age")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 56, Finished, Available)
import matplotlib.pyplot as plt
def histogram_boxplot(data, column):
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
# Histogram
ax[0].hist(data[column], bins=20, color='skyblue', edgecolor='black')
ax[0].set_title(f'Histogram of {column}')
ax[0].set_xlabel(column)
ax[0].set_ylabel('Frequency')
# Boxplot
ax[1].boxplot(data[column], vert=False)
ax[1].set_title(f'Boxplot of {column}')
ax[1].set_xlabel(column)
plt.tight_layout()
plt.show()
# Call the function with the data and column name
histogram_boxplot(data, "Experience")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 60, Finished, Available)
import plotly.express as px
fig = px.histogram(data, x='Income', title='Histogram of Income', template='plotly_dark')
fig.update_layout(bargap=0.1)
fig.show()
fig = px.box(data, y='Income', title='Boxplot of Income', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 61, Finished, Available)
import plotly.express as px
fig = px.histogram(data, x='CCAvg', title='Histogram of CCAvg', template='plotly_dark')
fig.update_layout(bargap=0.1)
fig.show()
fig = px.box(data, y='CCAvg', title='Boxplot of CCAvg', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 63, Finished, Available)
import plotly.express as px
fig = px.histogram(data, x='Mortgage', title='Histogram of Mortgage', template='plotly_dark')
fig.update_layout(bargap=0.1)
fig.show()
fig = px.box(data, y='Mortgage', title='Boxplot of Mortgage', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 65, Finished, Available)
labeled_barplot(data, "Family", perc=True)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 66, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='Education', title='Barplot of Education', template='plotly_dark', color='Education')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 68, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='Securities_Account', title='Barplot of Securities Account', template='plotly_dark', color='Securities_Account')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 70, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='CD_Account', title='Barplot of CD Account', template='plotly_dark', color='CD_Account')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 72, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='Online', title='Barplot of Online', template='plotly_dark', color='Online')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 74, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='CreditCard', title='Barplot of CreditCard', template='plotly_dark', color='CreditCard')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 76, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='ZIPCode', title='Barplot of ZIPCode', template='plotly_dark', color='ZIPCode')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 78, Finished, Available)
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 79, Finished, Available)
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 80, Finished, Available)
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 82, Finished, Available)
stacked_barplot(data, "Education", "Personal_Loan")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 83, Finished, Available)
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
import plotly.express as px
fig = px.bar(data_frame=data, x='Family', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Family Size')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 85, Finished, Available)
import plotly.express as px
fig = px.bar(data_frame=data, x='Securities_Account', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Securities Account')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 87, Finished, Available)
import plotly.express as px
fig = px.bar(data_frame=data, x='CD_Account', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by CD Account')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 89, Finished, Available)
import plotly.express as px
fig = px.bar(data_frame=data, x='Online', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Online Banking Usage')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 91, Finished, Available)
import plotly.express as px
fig = px.bar(data_frame=data, x='CreditCard', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by Credit Card Ownership')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 93, Finished, Available)
import plotly.express as px
fig = px.bar(data_frame=data, x='ZIPCode', y='Personal_Loan', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan Status by ZIP Code')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 95, Finished, Available)
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 96, Finished, Available)
import matplotlib.pyplot as plt
# Group by Experience and Personal_Loan columns to get the count of each combination
grouped = data.groupby(['Experience', 'Personal_Loan']).size().unstack()
# Plotting the stacked bar plot
grouped.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Personal Loan vs Experience')
plt.xlabel('Experience')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 99, Finished, Available)
import matplotlib.pyplot as plt
# Grouped data for Personal_Loan and Income
grouped_data = {
'Personal_Loan': [0, 1],
'Income': [102.727, 10.9091] # Mean values for Income corresponding to Personal_Loan 0 and 1
}
# Creating a DataFrame from the grouped data
df_grouped = pd.DataFrame(grouped_data)
# Plotting the stacked barplot for Personal Loan and Income
df_grouped.plot(kind='bar', stacked=True)
plt.title('Stacked Barplot of Personal Loan vs. Income')
plt.xlabel('Personal Loan')
plt.ylabel('Income')
plt.xticks(ticks=[0, 1], labels=['0', '1'])
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 101, Finished, Available)
import plotly.express as px
fig = px.bar(data, x='CCAvg', color='Personal_Loan', barmode='stack')
fig.update_layout(title='Distribution of Personal Loan based on CCAvg', template='plotly_dark')
fig.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 104, Finished, Available)
Q1 = data['Income'].quantile(0.25) # To find the 25th percentile
Q3 = data['Income'].quantile(0.75) # To find the 75th percentile
IQR = Q3 - Q1 # Interquartile Range (75th percentile - 25th percentile)
lower = Q1 - 1.5 * IQR # Finding lower bound for outliers
upper = Q3 + 1.5 * IQR # Finding upper bound for outliers
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 109, Finished, Available)
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 110, Finished, Available)
ID 96.28 Age 0.00 Experience 0.00 Income 1.92 Family 0.00 CCAvg 0.00 Mortgage 11.26 dtype: float64
# dropping Experience as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 111, Finished, Available)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 112, Finished, Available)
Shape of Training set : (3500, 18) Shape of test set : (1500, 18) Percentage of classes in training set: Personal_Loan 0 0.905429 1 0.094571 Name: proportion, dtype: float64 Percentage of classes in test set: Personal_Loan 0 0.900667 1 0.099333 Name: proportion, dtype: float64
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 113, Finished, Available)
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 114, Finished, Available)
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 115, Finished, Available)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 116, Finished, Available)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 117, Finished, Available)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
feature_names = list(X_train.columns)
print(feature_names)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 118, Finished, Available)
['ID', 'Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 119, Finished, Available)
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 120, Finished, Available)
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ID <= 4936.50 | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [51.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- ZIPCode_93 > 0.50 | | | | | | |--- ID <= 1627.00 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- ID > 1627.00 | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- ID > 4936.50 | | | | | |--- CreditCard <= 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- CreditCard > 0.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- ZIPCode_92 <= 0.50 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- ZIPCode_92 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- ID <= 509.50 | | | | | | | | |--- ID <= 402.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- ID > 402.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ID > 509.50 | | | | | | | | |--- ID <= 4541.00 | | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | | | | | | |--- ID > 4541.00 | | | | | | | | | |--- ID <= 4725.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- ID > 4725.50 | | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- Mortgage <= 93.50 | | | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 93.50 | | | | | | | | | |--- Mortgage <= 104.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Mortgage > 104.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- ID <= 2942.00 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- ID > 2942.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Family <= 2.50 | | | | |--- Education_2 <= 0.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- ID <= 349.50 | | | | | | | | |--- ID <= 197.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- ID > 197.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ID > 349.50 | | | | | | | | |--- weights: [27.00, 0.00] class: 0 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- CCAvg <= 4.75 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- CCAvg > 4.75 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Education_3 > 0.50 | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | |--- ID <= 2104.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- ID > 2104.00 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- Education_2 > 0.50 | | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- Family > 2.50 | | | | |--- Age <= 57.50 | | | | | |--- Age <= 51.00 | | | | | | |--- weights: [0.00, 17.00] class: 1 | | | | | |--- Age > 51.00 | | | | | | |--- Age <= 53.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Age > 53.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 57.50 | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | |--- Age <= 59.50 | | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 59.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | |--- ZIPCode_93 > 0.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income > 116.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- weights: [0.00, 53.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [0.00, 62.00] class: 1 | |--- Family > 2.50 | | |--- weights: [0.00, 154.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 121, Finished, Available)
Imp Income 0.298018 Family 0.257587 Education_2 0.163412 Education_3 0.147127 CCAvg 0.044768 Age 0.029516 ID 0.020281 CD_Account 0.017273 ZIPCode_94 0.008713 ZIPCode_93 0.004766 Mortgage 0.003236 ZIPCode_92 0.003080 CreditCard 0.002224 Online 0.000000 Securities_Account 0.000000 ZIPCode_91 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 122, Finished, Available)
from sklearn.metrics import confusion_matrix
# Define the function to create a confusion matrix for test data
def confusion_matrix_sklearn(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
return cm
# Example usage with test data
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 1, 0, 1]
confusion_matrix_result = confusion_matrix_sklearn(y_true, y_pred)
print(confusion_matrix_result)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 124, Finished, Available)
[[2 0] [1 3]]
# Assuming decision_tree_perf_train is a trained decision tree model available in the environment
decision_tree_model = decision_tree_perf_train
# Check performance on test data with the defined decision_tree_model
decision_tree_perf_test = model_performance_classification_sklearn(X_test, y_test, decision_tree_model)
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 137, Finished, Available)
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=10,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=10,
random_state=1)Checking performance on training data
from sklearn.metrics import confusion_matrix
# Define the function to create confusion matrix
def confusion_matrix_sklearn(y_actual, y_predicted):
return confusion_matrix(y_actual, y_predicted)
# Create confusion matrix for train data
y_train_predicted = model.predict(X_train)
confusion_matrix_train = confusion_matrix_sklearn(y_actual, y_train_predicted)
confusion_matrix_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 143, Finished, Available)
array([[2230, 229],
[ 939, 102]])
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
# Define the Decision Tree model
decision_tree = DecisionTreeClassifier(random_state=42)
# Define the parameter grid to search through
param_grid = {
'max_depth': [3, 5, 7, 9],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Perform Grid Search Cross Validation
grid_search = GridSearchCV(estimator=decision_tree, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_actual)
# Get the best parameters and best score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Train a new model with the best parameters found
best_decision_tree_model = DecisionTreeClassifier(random_state=42, **best_params)
best_decision_tree_model.fit(X_train, y_actual)
# Evaluate performance on train data
decision_tree_tune_perf_train = model_performance_classification_sklearn(best_decision_tree_model, X_train, y_actual)
decision_tree_tune_perf_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 145, Finished, Available)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 146, Finished, Available)
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 147, Finished, Available)
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- weights: [79.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- weights: [117.00, 15.00] class: 0 | | |--- Income > 92.50 | | | |--- Family <= 2.50 | | | | |--- weights: [37.00, 14.00] class: 0 | | | |--- Family > 2.50 | | | | |--- Age <= 57.50 | | | | | |--- weights: [1.00, 20.00] class: 1 | | | | |--- Age > 57.50 | | | | | |--- weights: [7.00, 3.00] class: 0 |--- Income > 116.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [375.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- weights: [0.00, 53.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- weights: [0.00, 62.00] class: 1 | |--- Family > 2.50 | | |--- weights: [0.00, 154.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 148, Finished, Available)
Imp Income 0.337681 Family 0.275581 Education_2 0.175687 Education_3 0.157286 CCAvg 0.042856 Age 0.010908 ZIPCode_92 0.000000 ZIPCode_96 0.000000 ZIPCode_95 0.000000 ZIPCode_94 0.000000 ZIPCode_93 0.000000 ID 0.000000 ZIPCode_91 0.000000 Online 0.000000 CD_Account 0.000000 Securities_Account 0.000000 Mortgage 0.000000 CreditCard 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 149, Finished, Available)
Checking performance on test data
from sklearn.metrics import confusion_matrix
# Assuming X_test and y_test are the test data and corresponding true labels
y_pred = decision_tree_tune.predict(X_test)
confusion_matrix_sklearn = confusion_matrix(y_test, y_pred)
print(confusion_matrix_sklearn)
decision_tree_tune_perf_test = model_performance_classification_sklearn(X_test, y_test, decision_tree_tune)
decision_tree_tune_perf_test
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 157, Finished, Available)
pd.DataFrame(path)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 158, Finished, Available)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000223 | 0.001114 |
| 2 | 0.000250 | 0.001614 |
| 3 | 0.000268 | 0.002688 |
| 4 | 0.000272 | 0.003232 |
| 5 | 0.000273 | 0.004868 |
| 6 | 0.000276 | 0.005420 |
| 7 | 0.000381 | 0.005801 |
| 8 | 0.000527 | 0.006329 |
| 9 | 0.000625 | 0.006954 |
| 10 | 0.000700 | 0.007654 |
| 11 | 0.000769 | 0.010731 |
| 12 | 0.000882 | 0.014260 |
| 13 | 0.000889 | 0.015149 |
| 14 | 0.001026 | 0.017200 |
| 15 | 0.001305 | 0.018505 |
| 16 | 0.001647 | 0.020153 |
| 17 | 0.002333 | 0.022486 |
| 18 | 0.002407 | 0.024893 |
| 19 | 0.003294 | 0.028187 |
| 20 | 0.006473 | 0.034659 |
| 21 | 0.025146 | 0.084951 |
| 22 | 0.039216 | 0.124167 |
| 23 | 0.047088 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 159, Finished, Available)
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train) ## Complete the code to fit decision tree on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 161, Finished, Available)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 162, Finished, Available)
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 163, Finished, Available)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 164, Finished, Available)
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 165, Finished, Available)
DecisionTreeClassifier(ccp_alpha=0.00027210884353741507, random_state=1)
estimator_2 = DecisionTreeClassifier(
ccp_alpha=best_model.ccp_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 167, Finished, Available)
DecisionTreeClassifier(ccp_alpha=0.00027210884353741507,
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.00027210884353741507,
class_weight={0: 0.15, 1: 0.85}, random_state=1)Checking performance on training data
from sklearn.metrics import confusion_matrix
# Define the function to create a confusion matrix using sklearn
def confusion_matrix_sklearn(y_actual, y_predicted):
return confusion_matrix(y_actual, y_predicted)
# Create the confusion matrix for the training data
y_train_predicted = model.predict(X_train)
confusion_matrix_train = confusion_matrix_sklearn(y_actual, y_train_predicted)
confusion_matrix_train
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 171, Finished, Available)
array([[2230, 229],
[ 939, 102]])
decision_tree_tune_post_train = model_performance_classification_sklearn(X_train, y_actual)
decision_tree_tune_post_train
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 174, Finished, Available)
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 175, Finished, Available)
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | |--- weights: [6.15, 0.00] class: 0 | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | |--- ID <= 2043.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- ID > 2043.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- ID <= 934.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- ID > 934.50 | | | | | | |--- CCAvg <= 3.05 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- CCAvg > 3.05 | | | | | | | |--- Mortgage <= 162.00 | | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | | |--- ID <= 3334.00 | | | | | | | | | | |--- ID <= 1748.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- ID > 1748.00 | | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | | |--- ID > 3334.00 | | | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Mortgage > 162.00 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- ID <= 766.50 | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- ID > 766.50 | | | | |--- weights: [0.00, 6.80] class: 1 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 100.00 | | | | | |--- CCAvg <= 4.20 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- CCAvg > 4.20 | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | |--- Income > 100.00 | | | | | |--- Income <= 103.50 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Income > 103.50 | | | | | | |--- weights: [64.95, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 110.00 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- Income > 110.00 | | | | | |--- Income <= 116.50 | | | | | | |--- Mortgage <= 141.50 | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | |--- Age <= 48.50 | | | | | | | | | |--- ID <= 675.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- ID > 675.50 | | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | | | |--- Age > 48.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CreditCard > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Mortgage > 141.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Income > 116.50 | | | | | | |--- weights: [0.00, 45.05] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- ID <= 4505.50 | | | | | | |--- CCAvg <= 1.95 | | | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | | |--- ZIPCode_93 > 0.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- CCAvg > 1.95 | | | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | | | |--- ID <= 3239.00 | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | |--- ID > 3239.00 | | | | | | | | | |--- ID <= 4146.00 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- ID > 4146.00 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- ZIPCode_93 > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- ID > 4505.50 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 52.70] class: 1 | |--- Family > 2.50 | | |--- Income <= 113.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [3.90, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- Age <= 60.00 | | | | | | | | |--- ID <= 4176.00 | | | | | | | | | |--- Age <= 35.00 | | | | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- Education_3 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | | |--- Age > 35.00 | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | | |--- ID > 4176.00 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Age > 60.00 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 57.00 | | | | | |--- weights: [0.15, 11.90] class: 1 | | | | |--- Age > 57.00 | | | | | |--- weights: [0.75, 0.00] class: 0 | | |--- Income > 113.50 | | | |--- Age <= 66.00 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.50 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- CCAvg > 2.50 | | | | | | |--- Age <= 60.50 | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | |--- Age > 60.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 130.90] class: 1 | | | |--- Age > 66.00 | | | | |--- weights: [0.15, 0.00] class: 0
import pandas as pd
# Importance of features in the tree building (The importance of a feature is computed as the normalized total reduction of the criterion brought by that feature. It is also known as the Gini importance)
feature_importance_df = pd.DataFrame(estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns)
sorted_feature_importance_df = feature_importance_df.sort_values(by="Imp", ascending=False)
print(sorted_feature_importance_df)
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 178, Finished, Available)
Imp Income 0.593704 Education_2 0.136801 CCAvg 0.078498 Education_3 0.066939 Family 0.065630 ID 0.016482 Age 0.015917 CD_Account 0.011009 Securities_Account 0.004589 Mortgage 0.003723 ZIPCode_91 0.003320 ZIPCode_93 0.002744 CreditCard 0.000646 Online 0.000000 ZIPCode_92 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 179, Finished, Available)
Checking performance on test data
from sklearn.metrics import confusion_matrix
# Assuming y_test and y_pred are the true labels and predicted labels on the test data
confusion_matrix_sklearn = confusion_matrix(y_test, y_pred)
decision_tree_tune_post_test = model_performance_classification_sklearn(y_test, y_pred, sorted_feature_importance_df)
decision_tree_tune_post_test
# training performance comparison
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_tune_perf_train.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 185, Finished, Available)
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | |
|---|---|---|
| Accuracy | 1.0 | 1.0 |
| Recall | 1.0 | 1.0 |
| Precision | 1.0 | 1.0 |
| F1 | 1.0 | 1.0 |
models_test_comp_df = pd.concat(
[y_test.value_counts(normalize=True), y_test.value_counts(normalize=True)], axis=1
)
models_test_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]
print("Test performance comparison:")
models_test_comp_df
StatementMeta(, e0fb3f97-35a5-4012-9e11-34fa33017b82, 188, Finished, Cancelled)
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | |
|---|---|---|
| Personal_Loan | ||
| 0 | 0.900667 | 0.900667 |
| 1 | 0.099333 | 0.099333 |
Model Performance Comparison: The models have similar test performance based on the true label proportions in the test set. Training Performance: Both models do well on the training data, with similar metrics.
Model Choice: Both models predict similarly based on test results. Think about other factors like interpretability and speed to pick one of them. Extra Testing: Do more tests or experiments to see how these models handle new data and how well they work in real situations. Feature Impact Analysis: Examine feature importance more to know which variables affect model predictions the most and possibly improve feature selection methods.
1.Create new features like customer demographics, transaction patterns, and account behavior to capture potential loan customers' characteristics effectively.
2.Conduct in-depth analysis to identify key factors influencing customers' decision-making process regarding personal loans.
3.Build a predictive model using machine learning algorithms like logistic regression, decision trees, or random forests to predict the likelihood of a liability customer purchasing a personal loan.
4.Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score to ensure its effectiveness in identifying potential loan customers.
Utilize the model predictions to target marketing campaigns towards customers with a higher probability of purchasing a personal loan.
Segment customers based on their predicted likelihood of purchasing a loan to tailor marketing strategies and offers accordingly.
7.Regularly monitor and update the model with new data to ensure its relevance and accuracy in predicting potential loan customers.
By implementing these recommendations, AllLife Bank can enhance its marketing strategies, improve customer targeting, and increase the conversion rate of liability customers to personal loan customers, ultimately driving business growth and profitability.